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Abstract 

There is rising interest in veetor-spaee 
word embeddings and their use in NLP, 
espeeially given reeent methods for their 
fast estimation at very large seale. Nearly 
all this work, however, assumes a sin¬ 
gle veetor per word type—ignoring poly¬ 
semy and thus jeopardizing their useful¬ 
ness for downstream tasks. We present 
an extension to the Skip-gram model that 
effieiently learns multiple embeddings per 
word type. It differs from reeent related 
work by jointly performing word sense 
diserimination and embedding learning, 
by non-parametrieally estimating the num¬ 
ber of senses per word type, and by its ef- 
ficieney and sealability. We present new 
state-of-the-art results in the word similar¬ 
ity in eontext task and demonstrate its seal- 
ability by training with one maehine on a 
corpus of nearly 1 billion tokens in less 
than 6 hours. 

1 Introduction 

Representing words by dense, real-valued vector 
embeddings, also commonly called “distributed 
representations,” helps address the curse of di¬ 
mensionality and improve generalization because 
they can place near each other words having sim¬ 
ilar semantic and syntactic roles. This has been 
shown dramatically in state-of-the-art results on 
language modeling (Bengio et al, 2003; Mnih and 
Hinton, 2007) as well as improvements in other 
natural language processing tasks (Collobert and 
Weston, 2008; Turian et al, 2010). Substantial 
benefit arises when embeddings can be trained on 
large volumes of data. Hence the recent consider¬ 
able interest in the CBOW and Skip-gram models 

The first two authors contrihuted equally to this paper. 


of Mikolov et al (2013a); Mikolov et al (2013b)— 
relatively simple log-linear models that can be 
trained to produce high-quality word embeddings 
on the entirety of English Wikipedia text in less 
than half a day on one machine. 

There is rising enthusiasm for applying these 
models to improve accuracy in natural language 
processing, much like Brown clusters (Brown et 
al, 1992) have become common input features 
for many tasks, such as named entity extraction 
(Miller et al, 2004; Ratinov and Roth, 2009) and 
parsing (Koo et al, 2008; Tackstrbm et al, 2012). 
In comparison to Brown clusters, the vector em¬ 
beddings have the advantages of substantially bet¬ 
ter scalability in their training, and intriguing po¬ 
tential for their continuous and multi-dimensional 
interrelations. In fact. Bassos et al (2014) present 
new state-of-the-art results in CoNLL 2003 named 
entity extraction by directly inputting continuous 
vector embeddings obtained by a version of Skip- 
gram that injects supervision with lexicons. Sim¬ 
ilarly Bansal et al (2014) show results in depen¬ 
dency parsing using Skip-gram embeddings. They 
have also recently been applied to machine trans¬ 
lation (Zou et al, 2013; Mikolov et al, 2013c). 

A notable deficiency in this prior work is that 
each word type (e.g. the word string plant) has 
only one vector representation—polysemy and 
hononymy are ignored. This results in the word 
plant having an embedding that is approximately 
the average of its different contextual seman¬ 
tics relating to biology, placement, manufactur¬ 
ing and power generation. In moderately high¬ 
dimensional spaces a vector can be relatively 
“close” to multiple regions at a time, but this does 
not negate the unfortunate influence of the triangle 
inequality^ here: words that are not synonyms but 
are synonymous with different senses of the same 
word will be pulled together. For example, pollen 
and refinery will be inappropriately pulled to a dis- 

^For distance d, d{a, c) < d{a, b) -|- d{b, c). 



tance not more than the sum of the distanees plant- 
pollen and plant-refinery. Fitting the eonstraints of 
legitimate eontinuous gradations of semanties are 
ehallenge enough without the additional eneum- 
branee of these illegitimate triangle inequalities. 

Diseovering embeddings for multiple senses per 
word type is the foeus of work by Reisinger and 
Mooney (2010a) and Huang et al (2012). They 
both pre-eluster the eontexts of a word type’s to¬ 
kens into diseriminated senses, use the elusters to 
re-label the eorpus’ tokens aeeording to sense, and 
then learn embeddings for these re-labeled words. 
The seeond paper improves upon the first by em¬ 
ploying an earlier pass of non-diseriminated em¬ 
bedding learning to obtain veetors used to rep¬ 
resent the eontexts. Note that by pre-elustering, 
these methods lose the opportunity to jointly learn 
the sense-diseriminated veetors and the elustering. 
Other weaknesses inelude their fixed number of 
sense per word fype, and fhe eompufafional ex¬ 
pense of fhe fwo-sfep proeess—fhe Huang ef al 
(2012) mefhod look one week of eompulalion lo 
learn mulfiple embeddings for a 6,000 subsef of 
fhe 100,000 voeabulary on a eorpus eonfaining 
elose lo billion lokens.^ 

This paper presenls a new mefhod for learn¬ 
ing veelor-spaee embeddings for mulfiple senses 
per word fype, designed lo provide several ad- 
vanlages over previous approaehes. (1) Sense- 
diseriminafed veelors are learned joinlly wilh fhe 
assignmenl of loken eonfexls lo senses; Ihus we 
ean use fhe emerging sense represenfalion lo more 
aeeuralely perform fhe eluslering. (2) A non- 
paramelrie varianl of our mefhod aulomalieally 
diseovers a varying number of senses per word 
fype. (3) Effieienl online join! Iraining makes 
if fasl and sealable. We refer lo our mefhod as 
Multiple-sense Skip-gram, or MSSG, and ils non- 
paramelrie eounlerparl as NP-MSSG. 

Our mefhod builds on fhe Skip-gram model 
(Mikolov el al, 2013a), bul mainlains mulfiple 
veelors per word fype. During online Iraining 
wilh a parlieular loken, we use fhe average of ils 
eonlexl words’ veelors lo seleel Ihe token’s sense 
lhal is elosesl, and perform a gradienl update on 
lhal sense. In Ihe non-paramelrie version of our 
melhod, we build on facility location (Meyerson, 
2001): a new elusler is erealed wilh probabilily 
proportional to Ihe dislanee from Ihe eonlexl to Ihe 

^Personal communication with authors Eric H. Huang and 
Richard Socher. 


nearesl sense. 

We presenl experimenlal resulls demonslraling 
Ihe benelils of our approaeh. We show quali- 
lalive improvemenls over single-sense Skip-gram 
and Huang el al (2012), eomparing againsl word 
neighbors from our paramelrie and non-paramelrie 
melhods. We presenl quanlilalive resulls in Ihree 
lasks. On bolh Ihe SOWS and WordSim353 dala 
sels our melhods surpass Ihe previous slale-of- 
Ihe-arl. The Google Analogy lask is nol espe- 
eially well-suited for word-sense evaluation sinee 
ils laek of eonlexl makes seleeling Ihe sense dif- 
fieull; however our melhod dramalieally oulper- 
forms Huang el al (2012) on Ihis lask. Finally 
we also demonslrale sealabilly, learning multiple 
senses, Iraining on nearly a billion tokens in less 
lhan 6 hours—a 27x improvemenl on Huang el al. 

2 Related Work 

Mueh prior work has foeused on learning veelor 
represenlalions of words; here we will deseribe 
only Ihose mosl relevanl to underslanding Ihis pa¬ 
per. Our work is based on neural language mod¬ 
els, proposed by Bengio el al (2003), whieh extend 
Ihe Iradilional idea of n-gram language models by 
replaeing Ihe eondilional probabilily fable wilh a 
neural nelwork, representing eaeh word token by 
a small veelor instead of an indiealor variable, and 
estimating Ihe parameters of Ihe neural nelwork 
and Ihese veetors joinlly. Sinee Ihe Bengio el al 
(2003) model is quite expensive to Irain, mueh re- 
seareh has foeused on optimizing il. Colloberl and 
Weston (2008) replaees Ihe max-likelihood ehar- 
aeler of Ihe model wilh a max-margin approaeh, 
where Ihe nelwork is eneouraged to seore Ihe eor- 
reel n-grams higher lhan randomly ehosen ineor- 
reel n-grams. Mnih and Hinton (2007) replaees 
Ihe global normalization of Ihe Bengio model wilh 
a Iree-slruelured probabilily dislribulion, and also 
eonsiders multiple positions for eaeh word in Ihe 
free. 

More relevanlly, Mikolov el al (2013 a) and 
Mikolov el al (2013b) propose exlremely eom- 
pulalionally effieienl log-linear neural language 
models by removing Ihe hidden layers of Ihe neu¬ 
ral nelworks and Iraining from larger eonlexl win¬ 
dows wilh very aggressive subsampling. The 
goal of Ihe models in Mikolov el al (2013a) and 
Mikolov el al (2013b) is nol so mueh oblain- 
ing a low-perplexily language model as learn¬ 
ing word represenlalions whieh will be useful in 



downstream tasks. Neural networks or log-linear 
models also do not appear to be neeessary to 
learn high-quality word embeddings, as Dhillon 
and Ungar (2011) estimate word veetor repre¬ 
sentations using Canonieal Correlation Analysis 
(CCA). 

Word veetor representations or embeddings 
have been used in various NLP tasks sueh 
as named entity reeognition (Neelakantan and 
Collins, 2014; Passos et al, 2014; Turian et al, 

2010) , dependeney parsing (Bansal et al, 2014), 
ehunking (Turian et al, 2010; Dhillon and Ungar, 

2011) , sentiment analysis (Maas et al, 2011), para¬ 
phrase deteetion (Soeher et al, 2011) and learning 
representations of paragraphs and doeuments (Le 
and Mikolov, 2014). The word elusters obtained 
from Brown elustering (Brown et al, 1992) have 
similarly been used as features in named entity 
reeognition (Miller et al, 2004; Ratinov and Roth, 
2009) and dependeney parsing (Koo et al, 2008), 
among other tasks. 

There is eonsiderably less prior work on learn¬ 
ing multiple veetor representations for the same 
word type. Reisinger and Mooney (2010a) intro- 
duee a method for eonstrueting multiple sparse, 
high-dimensional veetor representations of words. 
Huang et al (2012) extends this approaeh ineor- 
porating global doeument eontext to learn mul¬ 
tiple dense, low-dimensional embeddings by us¬ 
ing reeursive neural networks. Both the meth¬ 
ods perform word sense diserimination as a pre- 
proeessing step by elustering eontexts for eaeh 
word type, making training more expensive. 
While methods sueh as those deseribed in Dhillon 
and Ungar (2011) and Reddy et al (2011) use 
token-speeifie representations of words as part 
of the learning algorithm, the final outputs are 
still one-to-one mappings between word types and 
word embeddings. 

3 Background: Skip-gram model 

The Skip-gram model learns word embeddings 
sueh that they are useful in predieting the sur¬ 
rounding words in a sentenee. In the Skip-gram 
model, v{w) G is the veetor representation of 
the word w G W, where W is the words voeabu- 
lary and d is the embedding dimensionality. 

Given a pair of words (wt,c}, the probability 
that the word c is observed in the eontext of word 


wt is given by, 

P{D = l\v{wt),v{c)) = ^ ^ (1) 

The probability of not observing word c in the eon¬ 
text of Wt is given by, 

P{D = 0\v{wt),v{c)) = 

1 - PiD = l\v{wt),v{c)) 

Given a training set eontaining the sequenee of 
word types wi,W 2 ,---, wt, the word embeddings 
are learned by maximizing the following objeetive 
funetion: 

■m = E E logP{D = l\v{wt),v{c)) 

(wt,ct)£D+ c£ct 

+ E E logP(i9 = 0\v{wt),v{c)) 

{wt,c^)£D~ c'&c[ 

where wt is the word in the training set, ct 
is the set of observed eontext words of word wt 
and dt is the set of randomly sampled, noisy eon¬ 
text words for the word wt- eonsists of 

the set of all observed word-eontext pairs {wt, ct) 
{t = 1,2...,T). D~ eonsists of pairs {wt,c[) 
{t = 1,2 ... ,T) where c't is the set of randomly 
sampled, noisy eontext words for the word wt. 

For eaeh training word wt, the set of eontext 
words Ct = {wt-Rt,..., wt-i,wt+i,..., wt+Rt} 
ineludes Rt words to the left and right of the given 
word as shown in Figure 1. is the window size 
eonsidered for the word wt uniformly randomly 
sampled from the set {1, 2,..., iV}, where N is 
the maximum eontext window size. 

The set of noisy eontext words cj for the word 
Wt is eonstrueted by randomly sampling S noisy 
eontext words for eaeh word in the eontext ct. The 
noisy eontext words are randomly sampled from 
the following distribution, 

_ Punigramjw)^/'^ 

where Punigram{w) is the unigram distribution of 
the words and Z is the normalization eonstant. 

4 Multi-Sense Skip-gram (MSSG) model 

To extend the Skip-gram model to learn multiple 
embeddings per word we follow previous work 
(Huang et al, 2012; Reisinger and Mooney, 2010a) 
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Figure 1: Architecture of the Skip-gram model 
with window size Rt = 2. Context ct of word 
wt consists of wt-i,wt- 2 ,wt+i,wt+ 2 - 


and let each sense of word have its own embed¬ 
ding, and induce the senses by clustering the em¬ 
beddings of the context words around each token. 
The vector representation of the context is the av¬ 
erage of its context words’ vectors. For every word 
type, we maintain clusters of its contexts and the 
sense of a word token is predicted as the cluster 
that is closest to its context representation. After 
predicting the sense of a word token, we perform 
a gradient update on the embedding of that sense. 
The crucial difference from previous approaches 
is that word sense discrimination and learning em¬ 
beddings are performed jointly by predicting the 
sense of the word using the current parameter es¬ 
timates. 

In the MSSG model, each word m G VF is 
associated with a global vector Vg{w) and each 
sense of the word has an embedding (sense vec¬ 
tor) Vs{w, fe) (/c = 1, 2,..., AT) and a context clus¬ 
ter with center /i(m, /c) (/c = 1,2,..., K). The K 
sense vectors and the global vectors are of dimen¬ 
sion d and AT is a hyperparameter. 

Consider the word wt and let ct = 
., 101 - 1 ,wt+i,....,wt+R^} be the 
set of observed context words. The vector repre¬ 
sentation of the context is defined as the average 
of the global vector representation of the words in 
the context, h&i Vcontext{ct) = 
be the vector representation of the context q. We 
use the global vectors of the context words instead 
of its sense vectors to avoid the computational 
complexity associated with predicting the sense 
of the context words. We predict st, the sense 
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Figure 2: Architecture of Multi-Sense Skip-gram 
(MSSG) model with window size Rt = 2 and 
AT = 3. Context q of word wt consists of 
wt-i,wt- 2 ,wt+i,wt+ 2 . The sense is predicted by 
finding the cluster center of the context that is clos¬ 
est to the average of the context vectors. 


of word Wt when observed with context q as 
the context cluster membership of the vector 
Vcontext{ct) as shown in Figure 2. More formally, 

St= SilgTROy. sim{^JL{wt,k),Vcontext{ct)) (3) 
k=l,2,...,K 

The hard cluster assignment is similar to the k- 
means algorithm. The cluster center is the aver¬ 
age of the vector representations of all the contexts 
which belong to that cluster. For sim we use co¬ 
sine similarity in our experiments. 

Here, the probability that the word c is observed 
in the context of word wt given the sense of the 
word Wt is, 

P{D = l\st,Vs{wt,l),... ,Vs{wt, K),Vg{c)) 

= P{D = l\vsiwt,st),Vg{c)) 

1 

1 _|_ Q-VsiWt,StpVg(c) 

The probability of not observing word c in the con¬ 
text of wt given the sense of the word wt is, 

P{D = 0\st,Vs{wt, 1 ),... ,Vs{wt,K),Vg{c)) 

= P{D = 0\vs{wt,st),Vg{c)) 

= 1 - P{D = l\vsiwt, St), Vg{c)) 

Given a training set containing the sequence of 
word types wi,W 2 , ■■■,wt, the word embeddings 
are learned by maximizing the following objective 


























Algorithm 1 Training Algorithm of MSSG model 
1: Input: wi,W 2 , d, K, N. 

2: Initialize Vs{w,k) and Vg{w), Mw ^ W,k ^ 
,K} randomly, k) Vte € W,k £ 
to 0. 

3: for f = 1,2,..., T do 
4: iit ~ {1, . . . ,iV} 

5 : Ct = {wt-Rt, . . .,Wt-i,Wt+i, . . .,Wt+R,} 

6: Vcontext{ct) = ^c£ct '^9^^ 

7: St = argmaXfc=i 2,...,x { 

sim{n{wt, k),Vcontextict))} 

8: Update eontext eluster eenter ij,{wt,st) 

sinee eontext c* is added to eontext eluster st 
of word wt- 

9: c[ = Noisy-Samples{ct) 

10: Gradient update on Vg{wt, st), global vee- 

tors of words in a and 4- 

11: end for 

12: Output: Vs{w,k), Vg{w) and eontext eluster 
eenters ij,{w, k), \/w £ W,k £ {I,..., K} 

funetion: 

j(«) = 

E E logP(i:> = l\Vsiwt,St),Vg{c)) + 

(wt,ct)£D+ c&ct 

E E \ogP{D = {)\Vs{wt,St),Vg{c')) 

(wt,c'i.)&D- c'&c'i. 

where wt is the word in the sequenee, ct is the 
set of observed eontext words and 4 is the set of 
noisy eontext words for the word wt- D~^ and D~ 
are eonstrueted in the same way as in the Skip- 
gram model. 

After predieting the sense of word wt, we up¬ 
date the embedding of the predieted sense for 
the word wt (vs{wt, st)), the global veetor of the 
words in the eontext and the global veetor of the 
randomly sampled, noisy eontext words. The eon¬ 
text eluster eenter of eluster st for the word wt 
St)) is updated sinee eontext ct is added to 
the eluster st- 

5 Non-Parametric MSSG model 
(NP-MSSG) 

The MSSG model learns a fixed number of senses 
per word type. In this seetion, we deseribe a 
non-parametrie version of MSSG, the NP-MSSG 
model, whieh learns varying number of senses per 
word type. Our approaeh is elosely related to 


the online non-parametrie elustering proeedure de- 
seribed in Meyerson (2001). We ereate a new elus¬ 
ter (sense) for a word type with probability propor¬ 
tional to the distanee of its eontext to the nearest 
eluster (sense). 

Eaeh word w £Wis assoeiated with sense vee- 
tors, eontext elusters and a global veetor Vg{w) as 
in the MSSG model. The number of senses for a 
word is unknown and is learned during training. 
Initially, the words do not have sense veetors and 
eontext elusters. We ereate the first sense veetor 
and eontext eluster for eaeh word on its first oeeur- 
renee in the training data. After ereating the first 
eontext eluster for a word, a new eontext eluster 
and a sense veetor are ereated online during train¬ 
ing when the word is observed with a eontext were 
the similarity between the veetor representation of 
the eontext with every existing eluster eenter of the 
word is less than A, where A is a hyperparameter 
of the model. 

Consider the word wt and let ct = 
{wt-R^,...,wt-i,wt+i,...,wt+R^} be the 
set of observed eontext words. The veetor repre¬ 
sentation of the eontext is defined as fhe average 
of fhe global veefor represenfafion of fhe words in 
fhe eonfexf. Lef Vcontext{ct) = Ylc&ct 
be fhe veefor represenfafion of fhe eonfexf q. Lef 
k{wt) be fhe number of eonfexf elusfers or fhe 
number of senses eurrenfly assoeiafed wifh word 
Wt- St, fhe sense of word wt when k{wt) > 0 is 
given by 

k{wt) -f 1, ifmaxfc=i^ 2 ,...,fc(«;t){sfm 
^ if^{wt, k), Vcontextip))} ^ 

kmax, ofherwise 

(4) 

where p{wt,k) is fhe elusfer eenfer of 
fhe k^^ elusfer of word wt and kmax = 
argmaxfc^i 2,...,fc(«)t) sim{fi{wt, k),Vcontext{ct))- 

The elusfer eenfer is fhe average of fhe veefor 
represenfafions of all fhe eonfexfs whieh belong fo 
fhaf elusfer. If st = k{wt) + 1, a. new eonfexf 
elusfer and a new sense veefor are ereafed for fhe 
word Wt- 

The NP-MSSG model and fhe MSSG model 
deseribed previously differ only in fhe way word 
sense diseriminafion is performed. The objee- 
five funelion and fhe probabilisfie model assoei- 
afed wifh observing a (word, eonfexf) pair given 
fhe sense of fhe word remain fhe same. 




Model 

Time (in hours) 

Huang et al 

168 

MSSG 50d 

1 

MSSG-300d 

6 

NP-MSSG-50d 

1.83 

NP-MSSG-300d 

5 

Skip-gram-50d 

0.33 

Skip-gram-300d 

1.5 


Table 1: Training Time Results. First five model 
reported in the table are eapable of learning mul¬ 
tiple embeddings for eaeh word and Skip-gram 
is eapable of learning only single embedding for 
eaeh word. 

6 Experiments 

To evaluate our algorithms we train embeddings 
using the same eorpus and voeabulary as used in 
Huang et al (2012), whieh is the April 2010 snap¬ 
shot of the Wikipedia eorpus (Shaoul and West- 
bury, 2010). It eontains approximately 2 million 
artieles and 990 million tokens. In all our experi¬ 
ments we remove all the words with less than 20 
oeeurrenees and use a maximum eontext window 
(N) of length 5 (5 words before and after the word 
oeeurrenee). We fix the number of senses (K) to 
be 3 for the MSSG model unless otherwise speei- 
fied. Our hyperparameter values were seleeted by 
a small amount of manual exploration on a vali¬ 
dation set. In NP-MSSG we set A to -0.5. The 
Skip-gram model, MSSG and NP-MSSG models 
sample one noisy eontext word (5) for eaeh of the 
observed eontext words. We train our models us¬ 
ing AdaGrad stoehastie gradient deeent (Duehi et 
al, 2011) with initial learning rate set to 0.025. 
Similarly to Huang et al (2012), we don’t use a 
regularization penalty. 

Below we deseribe qualitative results, display¬ 
ing the embeddings and the nearest neighbors of 
eaeh word sense, and quantitative experiments in 
two benehmark word similarity tasks. 

Table 1 shows time to train our models, eom- 
pared with other models from previous work. All 
these times are from single-maehine implementa¬ 
tions running on similar-sized eorpora. We see 
that our model shows signifieant improvement in 
the training time over the model in Huang et 
al (2012), being within well within an order-of- 
magnitude of the training time for Skip-gram mod¬ 
els. 


Apple 


Skip-gram 

blackberry, macintosh, acorn, pear, plum 

MSSG 

pear, honey, pumpkin, potato, nut 
microsoft, activision, sony, retail, gamestop 
macintosh, pc, ibm, iigs, chipsets 

NP-MSSG 

apricot, blackberry, cabbage, blackberries, pear 
microsoft, ibm, wordperfect, amiga, trs-80 


Fox 


Skip-gram 

abc, nbc, soapnet, espn, kttv 

MSSG 

beaver, wolf, moose, otter, swan 

nbc, espn, cbs, ctv, pbs 

dexter, myers, sawyer, kelly, griffith 

NP-MSSG 

rabbit, squirrel, wolf, badger, stoat 
cbs,abc, nbc, wnyw, abc-tv 


Net 


Skip-gram 

profit, dividends, pegged, profits, nets 

MSSG 

snap, sideline, ball, game-trying, scoring 
negative, offset, constant, hence, potential 
pre-tax, billion, revenue, annualized, us$ 

NP-MSSG 

negative, total, transfer, minimizes, loop 
pre-tax, taxable, per, billion, us$, income 
ball, yard, fouled, bounced, 50-yard 
wnet, tvontorio, cable, tv, tv-5 


Rock 


Skip-gram 

glam, indie, punk, band, pop 

MSSG 

rocks, basalt, boulders, sand, quartzite 
alternative, progressive, roll, indie, blues-rock 
rocks, pine, rocky, butte, deer 

NP-MSSG 

granite, basalt, outcropping, rocks, quartzite 
alternative, indie, pop/rock, rock/metal, blues-rock 


Run 


Skip-gram 

running, ran, runs, afoul, amok 

MSSG 

running, stretch, ran, pinch-hit, runs 
operated , running, runs, operate, managed 
running, runs, operate, drivers, configure 

NP-MSSG 

two-run, walk-off, runs, three-runs, starts 
operated, runs, serviced, links, walk 
running, operating, ran, go, configure 
re-election, reelection, re-elect, unseat, term-limited 
helmed, longest-running, mtv, promoted, produced 


Table 2: Nearest neighbors of eaeh sense of eaeh 
word, by eosine similarity, for different algo¬ 
rithms. Note that the different senses elosely eor- 
respond to intuitions regarding the senses of the 
given word types. 


6.1 Nearest Neighbors 

Table 2 shows qualitatively the results of dis- 
eovering multiple senses by presenting the near¬ 
est neighbors assoeiated with various embeddings. 
The nearest neighbors of a word are eomputed by 
eomparing the eosine similarity between the em¬ 
bedding for eaeh sense of the word and the eontext 
embeddings of all other words in the voeabulary. 
Note that eaeh of the diseovered senses are indeed 
semantieally eoherent, and that a reasonable num¬ 
ber of senses are ereated by the non-parametrie 
method. Table 3 shows the nearest neighbors of 
the word plant for Skip-gram, MSSG , NP-MSSG 
and Haung’s model (Huang et al, 2012). 





Skip- 

gram 

plants, flowering, weed, fungus, biomass 

MS 

-SG 

plants, tubers, soil, seed, biomass 
refinery, reactor, coal-fired, factory, smelter 
asteraceae, fabaceae, arecaceae, lamiaceae, eri- 
caceae 

NP 

MS 

-SG 

plants, seeds, pollen, fungal, fungus 
factory, manufacturing, refinery, bottling, steel 
fabaceae, legume, asteraceae, apiaceae, flowering 
power, coal-fired, hydro-power, hydroelectric, re¬ 
finery 

Hua 
-ng 
et al 

insect, capable, food, solanaceous, subsurface 
robust, belong, pitcher, comprises, eagles 
food, animal, catching, catch, ecology, fly 
seafood, equipment, oil, dairy, manufacturer 
facility, expansion, corporation, camp, co. 
treatment, skin, mechanism, sugar, drug 
facility, theater, platform, structure, storage 
natural, blast, energy, hurl, power 
matter, physical, certain, expression, agents 
vine, mute, chalcedony, quandong, excrete 


Table 3: Nearest Neighbors of the word plant 
for different models. We see that the diseovered 
senses in both our models are more semantieally 
eoherent than Huang et al (2012) and NP-MSSG 
is able to learn reasonable number of senses. 

6.2 Word Similarity 

We evaluate our embeddings on two related 
datasets: the WordSim-353 (Finkelstein et al, 
2001) dataset and the Contextual Word Similari¬ 
ties (SCWS) dataset Huang et al (2012). 

WordSim-353 is a standard dataset for evaluat¬ 
ing word veetor representations. It eonsists of a 
list of pairs of word types, the similarity of whieh 
is rated in an integral seale from 1 to 10. Pairs 
inelude both monosemie and polysemie words. 
These seores to eaeh word pairs are given with¬ 
out any nontextual information, whieh makes them 
trieky to interpret. 

To overeome this issue, Stanford’s Contextual 
Word Similarities (SCWS) dataset was developed 
by Huang et al (2012). The dataset eonsists of 
2003 word pairs and their sentential eontexts. It 
eonsists of 1328 noun-noun pairs, 399 verb-verb 
pairs, 140 verb-noun, 97 adjeetive-adjeetive, 30 
noun-adjeetive, 9 verb-adjeetive, and 241 same- 
word pairs. We evaluate and eompare our embed¬ 
dings on both WordSim-353 and SCWS word sim¬ 
ilarity eorpus. 

Sinee it is not trivial to deal with multiple em¬ 
beddings per word, we eonsider the following sim¬ 
ilarity measures between words w and w' given 
their respeetive eontexts c and c', where P{w, c, k) 
is the probability that w takes the sense given 


the eontext c, and d{vs{w, i),Vs{w', j)) is the sim¬ 
ilarity measure between the given embeddings 

Vsiu!,i) and Vs{w',j). 

The avgSim metrie, 

avgSim(ru, w') 

^ K K 

i=l j=l 

eomputes the average similarity over all embed¬ 
dings for eaeh word, ignoring information from 
the eontext. 

To address this, the avgSimC metrie, 

K K 

avgSimC(ru, m') = EE P{w,c,i)P{w',c',j) 

i=i i=i 

X d{vs{w,i),Vsiw',j)) 

weighs the similarity between eaeh pair of senses 
by how well does eaeh sense fit the eontext at 
hand. 

The globalSim metrie uses eaeh word’s global 
eontext veetor, ignoring the many senses: 

globalSim(ru, w') = d {vg{w),Vg{w')). 

Finally, localSim metrie seleets a single sense 
for eaeh word based independently on its eontext 
and eomputes the similarity by 

localSim(r(;, w') = d {vs{w, k),Vs{w', k ')), 

where k = argmaXj P(tu, c, i) and k' = 
argmaXj P(r(;', c',y) and P{w,c,i) is the prob¬ 
ability that w takes the sense given eontext c. 
The probability of being in a eluster is ealeulated 
as the inverse of the eosine distanee to the eluster 
eenter (Huang et al, 2012). 

We report the Spearman eorrelation between a 
model’s similarity seores and the human judge¬ 
ments in the datasets. 

Table 5 shows the results on WordSim-353 
task. C&W refers to the language model by Col- 
lobert and Weston (2008) and HLBL model is the 
method deseribed in Mnih and Hinton (2007). On 
WordSim-353 task, we see that our model per¬ 
forms signifieantly better than the previous neural 
network model for learning multi-representations 
per word (Huang et al, 2012). Among the meth¬ 
ods that learn low-dimensional and dense repre¬ 
sentations, our model performs slightly better than 
Skip-gram. Table 4 shows the results for the 
SCWS task. In this task, when the words are 



Model 

globalSim 

avgSim 

avgSimC 

localSim 

TF-IDF 

26.3 

- 

- 

- 

Collobort & Weston-50d 

57.0 

- 

- 

- 

Skip-gram-50d 

63.4 

- 

- 

- 

Skip-gram-300d 

65.2 

- 

- 

- 

Pruned TF-IDF 

62.5 

60.4 

60.5 

- 

Huang et al-50d 

58.6 

62.8 

65.7 

26.1 

MSSG-50d 

62.1 

64.2 

66.9 

49.17 

MSSG-300d 

65.3 

67.2 

69.3 

57.26 

NP-MSSG-50d 

62.3 

64.0 

66.1 

50.27 

NP-MSSG-300d 

65.5 

67.3 

69.1 

59.80 


Table 4: Experimental results in the SCWS task. The numbers are Spearmans correlation p x 100 
between each model’s similarity judgments and the human judgments, in context. First three models 
learn only a single embedding per model and hence, avgSim, avgSimC and localSim are not reported 
for these models, as they’d be identical to globalSim. Both our parametric and non-parametric models 
outperform the baseline models, and our best model achieves a score of 69.3 in this task. NP-MSSG 
achieves the best results when globalSim, avgSim and localSim similarity measures are used. The best 
results according to each metric are in bold face. 


Model 

p X 100 

HFBF 

33.2 

C&W 

55.3 

Skip-gram-300d 

70.4 

Huang et al-G 

22.8 

Huang et al-M 

64.2 

MSSG 50d-G 

60.6 

MSSG 50d-M 

63.2 

MSSG 300d-G 

69.2 

MSSG 300d-M 

70.9 

NP-MSSG 50d-G 

61.5 

NP-MSSG 50d-M 

62.4 

NP-MSSG 300d-G 

69.1 

NP-MSSG 300d-M 

68.6 

Pruned TF-IDF 

73.4 

ESA 

75 

Tiered TF-IDF 

76.9 


Table 5: Results on the WordSim-353 dataset. 
The table shows the Spearmans correlation p be¬ 
tween the model’s similarities and human judg¬ 
ments. G indicates the globalSim similarity mea¬ 
sure and M indicates avgSim measure.The best 
results among models that learn low-dimensional 
and dense representations are in bold face. Pruned 
TF-IDF (Reisinger and Mooney, 2010a), ESA 
(Gabrilovich and Markovitch, 2007) and Tiered 
TF-IDF (Reisinger and Mooney, 2010b) construct 
spare, high-dimensional representations. 



Figure 3: The plot shows the distribution of num¬ 
ber of senses learned per word type in NP-MSSG 
model 

given with their context, our model achieves new 
state-of-the-art results on SCWS as shown in the 
Table-4. The previous state-of-art model (Huang 
et al, 2012) on this task achieves 65.7% using 
the avgSimC measure, while the MSSG model 
achieves the best score of 69.3% on this task. The 
results on the other metrics are similar. For a 
fixed embedding dimension, the model by Huang 
et al (2012) has more parameters than our model 
since it uses a hidden layer. The results show 
that our model performs better than Huang et al 
(2012) even when both the models use 50 dimen¬ 
sional vectors and the performance of our model 
improves as we increase the number of dimensions 
to 300. 

We evaluate the models in a word analogy task 
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Number of Senses = 3 



Figure 4: Shows the effect of varying embedding dimensionality of the MSSG Model on the SCWS task. 
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Figure 5: show the effect of varying number of senses of the MSSG Model on the SCWS task. 














Model 

Task 

Sim 

p X 100 

Skip-gram 

WS-353 

globalSim 

70.4 

MSSG 

WS-353 

globalSim 

68.4 

MSSG 

WS-353 

avgSim 

71.2 

NP MSSG 

WS-353 

globalSim 

68.3 

NP MSSG 

WS-353 

avgSim 

69.66 

MSSG 

sews 

localSim 

59.3 

MSSG 

sews 

globalSim 

64.7 

MSSG 

sews 

avgSim 

67.2 

MSSG 

sews 

avgSimC 

69.2 

NP MSSG 

sews 

localSim 

60.11 

NP MSSG 

sews 

globalSim 

65.3 

NP MSSG 

sews 

avgSim 

67 

NP MSSG 

sews 

avgSimC 

68.6 


Table 6: Experiment results on WordSim-353 and 
sews Task. Multiple Embeddings are learned for 
top 30,000 most frequent words in the voeabulary. 
The embedding dimension size is 300 for all the 
models for this task. The number of senses for 
MSSG model is 3. 

introdueed by Mikolov et al (2013 a) where both 
MSSG and NP-MSSG models aehieve 64% aeeu- 
racy eompared to 12% aeeuraey by Huang et al 
(2012). Skip-gram whieh is the state-of-art model 
for this task aehieves 67% aeeuraey. 

Eigure 3 shows the distribution of number of 
senses learned per word type in the NP-MSSG 
model. We learn the multiple embeddings for the 
same set of approximately 6000 words that were 
used in Huang et al (2012) for all our experiments 
to ensure fair eomparision. These approximately 
6000 words were ehoosen by Huang et al. mainly 
from the top 30,00 frequent words in the voeab¬ 
ulary. This seleetion was likely made to avoid 
the noise of learning multiple senses for infre¬ 
quent words. However, our method is robust to 
noise, whieh ean be seen by the good performanee 
of our model that learns multiple embeddings for 
the top 30,000 most frequent words. We found 
that even by learning multiple embeddings for the 
top 30,000 most frequent words in the voeubu- 
lary, MSSG model still aehieves state-of-art result 
on sews task with an avgSimC seore of 69.2 as 
shown in Table 6. 

7 Conclusion 

We present an extension to the Skip-gram model 
that effieiently learns multiple embeddings per 


word type. The model jointly performs word 
sense diserimination and embedding learning, and 
non-parametrieally estimates the number of senses 
per word type. Our method aehieves new state- 
of-the-art results in the word similarity in eon- 
text task and learns multiple senses, training on 
close to billion tokens in less than 6 hours. The 
global vectors, sense vectors and cluster centers of 
our model and code for learning them are avail¬ 
able at https://people.cs.umass.edu/ 
~arvind/emnlp2014wordvectors. In fu¬ 
ture work we plan to use the multiple embeddings 
per word type in downstream NEP tasks. 
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